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Abstract 

Proteins can sometimes be knotted, and for many reasons the study of knotted proteins is 
rapidly becoming very important. For example, it has been proposed that a knot increases 
the stability of a protein. Knots may also alter enzymatic activities and enhance binding. 
Moreover, knotted proteins may even have some substantial biomedical significance in relation 
to illnesses such as Parkinson's disease. But to a large extent the biological role of knots 
remains a conundrum. In particular, there is no explanation why knotted proteins are so 
scarce. Here we argue that knots are relatively rare because they tend to cause swelling in 
proteins that are too short, and presently short proteins are over- represented in the Protein 
Data Bank (PDB). Using Monte Carlo simulations we predict that the figure-8 knot leads to 
the most compact protein configuration when the number of amino acids is in the range of 
200 — 600. For the existence of the simplest knot, the trefoil, we estimate a theoretical upper 
bound of 300 — 400 amino acids, in line with the available PDB data. 
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1. INTRODUCTION 

There are currently only some 300 known knotted proteins [I] -[7] that are listed in the 
Protein Data Bank PDB [8j . Furthermore, most of them correspond to the same protein 
but in a different species. Alternatively, they appear in multiple domain proteins, and 
often with the same knot repeated in each of the domains. When we only consider 
proteins in a single domain, there are no more than 17 different known knotted proteins 
[6]. With one single exception of a figure-8 knot, these are all trefoil knots with the value 
of central carbons in the range of ~ 82 — 380. Even though there are examples 
of other knots such as the twist-3 knot, these have only been found in multiple domain 
proteins In the present article our goal is to identify some universal characteristics 
of knotted proteins and to try and employ these to make predictions on the existence of 
knots. In particular, we look for explanations why knots are so rare in the PDB data. 

Biologically active proteins are compact objects, and one might expect that compact- 
ness is in some (yet unknown) manner important for their function. Consequently we 
propose that a study of the relationship between knottedness and compactness could 
help to understand why knotted proteins are rare. Compactness can be measured by 
the Hausdorff dimension dn of the protein backbone, that can be determined from the 
scaling properties of the radius of gyration Rg In the limit where the number of 
central carbon atoms A^ becomes very large Rg obeys the scaling law 



with Fj {i = 1,2,. ..,N) the space coordinates of the central carbons. Here L is a 
dimensionfull swelling factor that sets the scale for the size of the protein, and the 
inverse Hausdorff dimension u = l/dfj is called the compactness index. The swelling 
factor L is not a universal quantity. But u is universal: Different values of u characterize 
different universality classes (phases) of proteins. Biologically active proteins that have 
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FIG. 1: Least square linear fit of radius of gyration Rg versus N for the 15 single domain 
knotted proteins in PDB, as described in the text. 



N in the range of 100 < < 1.000 obey the scahng law (jLT]) with v ^ 0.378 [ID]. This 
is very close to the value z/ = 1/3 that determines the universality class of fully collapsed 
protein (solid matter), the difference is presumably due to some yet to be understood 
finite scaling effects |llj . 



2. ANALYSIS OF PDB DATA 

In Figure 1 we display the radius of gyration for the 17 known single domain knotted 
proteins that are presently listed in PDB, versus the number of their central carbons A^. 
When we perform a least square linear fit to this data we find that the result is distorted 
by the shortest known knotted protein, the 2efv for which N=82 [8] . In order to have a 
meaningful fit we proceed by leaving out 2efv. We also leave out Iztu since it is (the 
only known) figure-8 protein. 

We are then left with the 15 trefoil proteins presently in PDB, and for these is in 
the range of iV e [147, 380] . For these trefoils we obtain the following least square linear 
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fit for tlie radius of gyration, 

Rtrefoii ^ L.^y (2.499 ± 0.661) • iY^-^^^i^o-i^o (2.I) 

We wish to compare this to the least square linear fit of Rg to all proteins presently 



in PDB. However, in order to ensure compatibility with (2.1), we choose only those 
proteins for which > 125. For these we get 

Rf ~ L-N" ~ (2.142 ± 0.03) ■ ivO-387±o.oo3 ^2.2) 
but we note that if we include all proteins in PDB we obtain the estimate 
Rf ^L-N" ~ (2.254 ± 0.021) ■ ivO-378±o.oo2 

Despite slight differences in the numerical values which is due to finite scaling effects, 
these two fits for Rf are very close to each other over the range of interest G [125, 400] . 
In our analysis we shall use both of them, and in this way we hope to control finite scaling 
corrections that are due to proteins with small values of N . 



Since the quality of data that underlies (2.1 ) is poor, one should be careful in drawing 



conclusions and with this in our mind we observe the following: 

Unknotted proteins have a clearly smaller value of L than trefoils. Thus for small 
the unknotted proteins have a tendency to be more compact (smaller) than trefoil 

proteins. When size matters this could explain why there are so few trefoils for small 

values of N in the PDB data: The trefoil knot causes increased swelling in small proteins. 
But when grows, the swelling caused by the trefoil knot starts to diminish. When 

A^ reaches a value N^. ^ 400 — 500 i.e. close to the upper bound A^ = 400 of our range. 



comparison between (2.1) and (2.2) predicts that the trefoil proteins become equally 
compact than the unknotted ones. (The exact value of N^. varies slightly depending on 
how we account for the finite scaling effects due to short proteins.) For N > Nc ^ 
400 — 500 our data for trefoil proteins is unreliable. But if the tendency continues 
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FIG. 2: Probability distributions of the generalized gamma form p{N) oc N"" eyi'p{—bN'^), fitted 
to the number density of all proteins (red line) and single domain trefoil proteins (blue line); 
The latter are displayed in Figure 1. 

there should be a range of values above Nc where the presence of a trefoil improves 
compactness over proteins without knots. In this range we expect that the relative 
number of trefoil proteins increases. 



Note that asymptotically, for very large values of A^, the scaling law (1.1) should be 
insensitive to the presence of a single localized knot. 

In order to verify the reasonableness of the previous conclusion and in particular 
whether there are additional reasons, we consider Figure 2 where we display the (prob- 
ability) density function for the number of proteins in PDB as a function of the length 
A^, both for all proteins and for the single domain trefoil proteins. For all proteins the 
probability density peaks at around = 90 while for trefoil proteins the peak is near 
A^ = 250 where the total number of resolved protein structures in PDB is already rel- 
atively small. In particular, we observe that there are relatively few resolved protein 
structures with A^ larger than ~ 400 — 500 where our estimates predict that trefoil 
proteins might become less swollen than unknotted ones. Consequently the small num- 
ber of trefoil knots could be partly due to the scarcity of data in PDB. This is in line 
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with our observation on the behaviour of (2.1), (2.2) for > 400. 

In addition, since the probabihty density of trefoils is sharply peaked within the range 
^ 150 — 400 a partial explanation could also be that there is some yet unknown reason 

why proteins with trefoil knots prefer these values of central carbons. 



3. THEORETICAL ESTIMATES 

In order to better understand the effect of knots on protein swelling, and in partic- 
ular to clarify whether the presence of trefoil knots tends to decrease or increase the 
compactness of proteins when > 400 we have theoretically investigated how differ- 
ent knots influence the radius of gyration of native state proteins when is within the 
range 125 < N < 800. For this we have employed the Landau- Ginsburg model of protein 
folding that we have described in [TUj. Since the model provides a good description of 
the universal aspects of protein folding in particular for the mostly-a and a/f3 family of 
proteins that have been found to support knots [6j, we expect it to have good predictive 
power on the effects of knots on protein swelling. 

We have performed extensive Monte Carlo simulations with two different sets of 
knotted proteins. We summarize the results in Table 1. The first set consists of chains 
that have either one, three or five distinct trefoil knots along the backbone. The second 
set consists of chains where the proteins have either the trefoil knot (3i), the figure-8 
knot (4i), or the twist-3 knot (52) along their backbone. The simulations have been 
performed by selecting up to 10 different values of A^ in the ranges dispalyed in Table 
1, with the exact values depending on the knot complexity (see Table 1), and then 
performing around 80 independent runs at each of these value A^ to compute the radius 
of gyration; See Figure 3. The initial configuration is a (relatively tight) knot located 
deep inside the protein structure. In each of the cases we find that the least square linear 
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FIG. 3: Least square linear fits of Rg for the trefoil (3i), figure-8 (^i) and twist-3 (62) proteins. 



fit of (1.1) provides an excellent match for the data, we have not been able to identify 



any kind of systematic nonhnear corrections. But in order to ehminate the influence of 



finite scaling effects that are due to short unknotted backbones, following (2.1), (2.2) we 
compare the knotted proteins to unknotted ones using a set of different ranges of values 
N for the latter. These are given in Table 2. 

We observe from Table 2 that the small finite scaling corrections tend to sys- 
tematically decrease the value of L and increase the value of u. Already for the range 
250 < N < 1.000 we find that u is very close to its theoretical value u = 1/3 correspond- 
ing to totally collapsed proteins. 

In the case of the first set we compare the unknotted proteins with proteins that have 
either one, three or five trefoil knots along their backbone. 

For proteins with a single trefoil (3i), we find that there is a tendency for long trefoil 
proteins to be more swollen than the unknotted ones. We estimate that a transition 
occurs at around ~ 300, and for N < Nc our simulations predict that trefoils are 
(slightly) more compact than unknots. Since we estimate that there is a minimum 
length of about 50-70 central carbons for a trefoil knot to form, our simulations suggest 
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knot type 


L 




AL 




^knot 


■Grange 


single 3i 


2.719 


0.378 


±0.072 


±0.013 


50-70 


125-550 


three 3i 


2.552 


0.395 


±0.109 


±0.019 




225-550 


five 3i 


2.500 


0.403 


±0.167 


±0.027 




350-800 


4i 


2.729 


0.373 


±0.109 


±0.018 


70-100 


225-625 


52 


2.764 


0.372 


±0.084 


±0.014 


90-110 


225-625 



TABLE I: Swelling factor L and compactness index v, with corresponding standard errors AL 
and Ai/ and an estimate for the average length of knots in the number of central carbons Nknot, 
for different knot types simulated using the model described in [10]. The last column gives 
the range of values of for which the simulations have been performed. The lower bound is 
selected to accomodate a deep knot. 

that deep trefoil knots should predominantly be present for values ~ 100 — 300. 
This conclusion is in line with the PDB data in Figure 1 over its range of validity, and 
consistent with the probability density displayed in Figure 2. 

We find that the presence of several trefoils along the backbone clearly increases 
swelling for all values of A^ we have studied. Consequently our simulations suggest that 
these configurations should be very rare. Indeed, in PDB data multiple trefoil knots 
have until now only been observed in multiple domain proteins, with only a single knot 
in the independent domains. 

In the case of our second set we compare the unknotted proteins to proteins with two 
more complex knots, the figure-8 (4i) knot and the twist-3 (52) knot along the protein 
backbone. 
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range of N 


L 


V 


AL 




75 <N < 1.000 


2.656 


0.379 


±0.049 


±0.008 


125 <N < 1.000 


3.103 


0.353 


±0.079 


±0.013 


175 <N < 1.000 


3.411 


0.338 


±0.093 


±0.015 


250 <N < 1.000 


3.522 


0.333 


±0.128 


±0.02 



TABLE II: Least square linear fits for L and v with standard errors AL and Azv, computed 
for unknotted proteins with varying range for A^. 

For the figure-8 knot in our range of N we find that the ensuing proteins are shghty 
more compact than the unknotted ones. We find an upper bound Nc ~ 600 beyond which 
the figure-8 proteins start to become more swollen than the unknotted ones. Since we 
estimate that the lower length of a figure-8 knot is close to 100 central carbons, we 
propose that these knots should be present in PDB data as deep knots, predominantly 
with in the range between 150 and 600. Thus far only one has been found, Iztu 
with = 320 [6j. But as visible in Figure 2 there are also relatively very few protein 
structures that have been resolved within this range of N. 

In the case of the twist-3 knot, we find that proteins with this knot are slightly more 
swollen than the figure-8 knots. But they appear more compact than unknots at least 
until N reaches a value N^. ^ 500 beyond which they appear to swell more than unknots. 
Furthermore, our estimates suggest that there is a lower bound at around N ^ 300 below 
which these protein knots begin to be more swollen than unknots. However, this estimate 
is to some extent plagued by finite scaling effects due to the presence of short unknotted 
proteins in the analysis. We note that the two known 52 knots |5] appear in multiple 
domain proteins, one (lxd3) in a domain with = 228 and the other (2etl) in a domain 
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with = 223 central carbons. 

Finally, in the theoretically important ^ oo limit the effect of a (localized) knot 



to the scaling law (1.1) must vanish: Thus the presence of a knot leads to an intricate 
non-linear finite scaling effect that remains to be understood in detail. 



4. CONCLUSIONS 



In summary, knots are examples of topologically nontrivial structures in proteins 
with potentially very high biomedical relevance. However, until now knots have been 
identified in only relatively few single domain proteins. In order to understand the reason 
we have analysed the data in PDB to conclude, that in the case of short proteins the 
(trefoil) knots have a tendency to increase the swelling of the folded protein backbone. 
But when the backbone length increases, the swelling due to trefoil seems to decrease 
and from the PDB data we estimate that for backbones with ^ 400 — 500 central 
carbons the swelling due to the trefoil disappears. The reason why knots are so rare in 
PDB data would then be partially explained by the fact that until now the structure of 
only relatively short proteins have been reliably resolved. 

But in addition, it could be that for some reason trefoil knots only appear for ^ 
100 — 400. In order to resolve this and clarify what effect knots have on protein swelling, 
we have performed extensive Monte-Carlo simulations using a Landau- Ginsburg model 
of protein folding. We find that for trefoil knots the swelling is minimal when A^ is 
within the range of 100 — 300. But in contrast to the interpolated PDB data, beyond 
Nc ~ 300 we predict that trefoils tend to increase swelling. This prediction is consistent 
with Figure 2 that shows a clear peak of trefoils near A^ ^ 250. In the case of multiple 
trefoil knots, we find increased swelling in all cases and consequently these configurations 
should remain quite rare unless the protein chains become much longer. 
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But when we increase the knot complexity, we find that it leads to an improved 
compactness for a range of values of A^. We note that by using a very different method, 
the authors of [12] arrived at a very similar conclusion. The effect appears to be most 
profound for figure-8 knot for which we predict an eventual relative increase in their 
number in PDB data as long as is less than ~ 600 but large enough to support a 
deep knot. For twist-3 we predict an eventual relative increase in their number in PDB 
data, until N reaches a value ~ 500. However, since the twist-3 appears to be (slightly) 
more swollen than figure-8, it should remain more rare. Since the numerical differences 
in swelling that are revealed in our simulations are quite smal andl our conclusions 
are based more on tendencies than clear numerical differences, the effects of various 
biological and evolutionary factors may eventually turn out to be more dominant than 
swelling. More exhaustive numerical simulations that in particular take into account 
the detailed amino acid structures of the proteins are thus needed before more detailed 
predictions on swelling can be made. Furthermore, it remains a theoretical challenge to 
understand the universal structure of finite scaling corrections to the radius of gyration 
in the case of collapsed protein chains. 
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